COVERT: Configurable Virtual Redundancy with Transparent Availability on Commodity Software

ثبت نشده
چکیده

Overview Scaling integrated circuit technology into the deep submicron regime is expected to increase both soft and hard error rates significantly [1]. Therefore, providing high availability in the presence of relatively unreliable components is likely to become an increasingly important requirement for a diverse set of systems, including general-purpose commodity systems. Traditionally, high availability systems have used specialized hardware and software that cater to a small subset of the application domain like banking and mission critical applications. For example, the HP NonStop system [2] uses a combination of enhanced commodity hardware and specialized system software to create and manage the synchronization of redundant processes. The IBM zSeries systems [3] and Stratus [4] use custom designed hardware, system software, and firmware to provide high availability. Custom designed hardware and/or software is not a viable solution for the cost-competitive commodity market. Both commodity hardware and software are resistant to significant changes, particularly if they affect the performance in non-fault tolerant configurations. Systems that use specialized system software are restricted by their ability to run only applications written for a particular OS, e.g, the NonStop kernel [1] and VOS [3]. In this poster, we present a technique for enabling the use of “off the shelf” general purpose software in high availability systems running on general purpose Chip Multiprocessors (CMPs). We propose enhancing a virtual machine monitor (VMM) to enable it to perform the same functions as a specialized high availability operating system, e.g. NonStop, with no changes to any commodity software, including the OS, runtime software, and application software. We call this solution “COVERT” – Configurable Virtual Redundancy with Transparent Availability. The COVERT software manages the creation, synchronization and output comparison (at I/O level similar to NonStop OS) of redundant threads and performs recovery in the event of a failure (Fig. 1(a)). The VMM approach used by COVERT provides transparent high availability to all conventional software – legacy, current, and future. The software is aided by configurable hardware that can provide fault isolation to the redundant threads [5]. The VMM creates a redundant replica by cloning a guest virtual machine that needs high availability. Then, it uses a state machine based approach to synchronize the duplicate VMs. The VMM needs to ensure that all inputs to both copies are identical and are delivered to the VMs at an identical state. Given that the two VMs are identical and the inputs are identical, their outputs should be identical, except for the occurrence of a hardware error, which can be detected by comparing the outputs. A significant design issue is maintaining identical inputs. We classify all the inputs to the VMs as deterministic or non-deterministic. The deterministic inputs can be passed directly to the guest VMs. However, the non-deterministic inputs must be coordinated by COVERT to appear identical to the duplicated VMs. COVERT is also responsible for creating checkpoints and output comparison to detect hardware errors. The poster will present a detailed design of COVERT software. We have evaluated the feasibility of the proposed design with an analytical model (Fig. 1(b)). Based on this model, we show that overheads for several compute intensive and I/O intensive benchmarks can be restricted to less than 20% (Fig. 1(c)).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Memory Resource Management in VMware ESX Server

VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified commodity operating systems. This paper introduces several novel ESX Server mechanisms and policies for managing memory. A ballooning technique reclaims the pages considered least valuable by the operating system running in a virtual machine. An idle memory t...

متن کامل

Object Oriented Metrics: Precision Tools and Configurable Visualisations

Software metrics are a valuable tool in helping software engineers to develop large, complex software systems. However, it is vital that transparency and precision are maintained at all stages. We contend that without grammars we cannot define metrics rigorously, without transparent and powerful parsing tools we cannot collect data accurately and without flexible configurable visualisation we c...

متن کامل

Transparent Fault-Tolerant Java Virtual Machine

Replication is one of the prominent approaches for obtaining fault tolerance. Implementing replication on commodity hardware and in a transparent fashion, i.e., without changing the programming model, has many challenges. Deciding at what level to implement the replication has ramifications on development costs and portability of the programs. Other difficulties lie in the coordination of the c...

متن کامل

Dependable 6= Unaffordable

This paper presents a software architecture for hardware fault tolerance based on loosely-synchronized, redundant virtual machines (LSRVM). LSRVM will provide high levels of reliability by tolerating hardware faults at all levels of the system. Historically, such hardware fault tolerance has only been achievable using customdesigned hardware and proprietary operating systems. Today, however, te...

متن کامل

Comparing Parallel Simulated Annealing, Parallel Vibrating Damp Optimization and Genetic Algorithm for Joint Redundancy-Availability Problems in a Series-Parallel System with Multi-State Components

In this paper, we study different methods of solving joint redundancy-availability optimization for series-parallel systems with multi-state components. We analyzed various effective factors on system availability in order to determine the optimum number and version of components in each sub-system and consider the effects of improving failure rates of each component in each sub-system and impr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008